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Standards-based education reform has a more than 20- 
year history. A standards-based vision was enacted in 
federal law under the Clinton administration with the 
1994 re authorization of the Elementary and Secondary 
Education Act (ESEA) and carried forward under the 
Bush administration with the No Child Eeft Behind Act 
(NCEB) of 2001.* In a recent survey of policy makers, 
standards were acknowledged as the central framework 
guiding state education policy.^ 

Yet, despite this apparent unanimity about the intuitively 
appealing idea of standards, there is great confusion 
about its operational meaning: exactly what should the 
standards be, how should they be set and by whom, and 
how should they be applied to ensure rigorous and high- 
quality education for American students are the central 
questions that challenge policy makers and educators. 
Eor example, content standards (subject-matter descrip- 
tions of what students should know and be able to do) 
are often confused with performance standards (which 
are more like passing scores on a test), and very different 
theories of action are used to explain how standards- 
based reforms are expected to work. Ambitious rhetoric 
has called for systemic reform and profound changes in 
curriculum and assessments to enable higher levels of 
learning. In reality, however, implementation of stan- 
dards has frequently resulted in a much more familiar 
policy of test-based accountability, whereby test items 
often become crude proxies for the standards. 

This disconnect between rhetoric and reality is one of the 
reasons for the failure of prior reforms. Eor example, 
early advocates of standards-based reforms were reacting 
against previous efforts focused on minimum competen- 
cies (such as balancing a checkbook) that had done little 
to improve the quality of instruction or student learning. 
To fight against low expectations and an incoherent, de 
facto curriculum driven by textbooks and basic skills 
tests, they called for clear and challenging content stan- 
dards and a coherent structure of state leadership that 
would provide long-term support to enable more funda- 
mental changes in instruction.^ In Promises to Keep: 
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Creating High Standards for American Students, a panel 
of policy makers and academics explained that raising 
performance in the way envisioned would require a sys- 
tematic and sustained effort from all levels of the educa- 
tion system. Clear and visible standards — identifying 
what students should know and he able to do — would 
need to be reinforced by “curricula, teacher training, in- 
structional materials, and assessment practices to enable 
students to meet them.”"^ 

A National Academy of Education Panel on Standards- 
Based Education Reform concurred that findings from 
cognitive science research make it at least theoretically 
possible to focus instruction on depth of understanding, 
and to provide the support for reaching a much more di- 
verse population of students. But, the report cautioned 
that extrapolating from small-scale, intensive studies to 
full-system reform was an unprecedented task — one that 
would require significant investments in teacher profes- 
sional development and ongoing evaluations to improve 
the system. 

Basic elements of the standards vision were established 
in the 1994 ESEA. The law required that states set chal- 
lenging and rigorous content standards for all students 
and develop assessments, aligned with the standards, to 
measure student progress. By holding schools account- 
able for meeting the standards, it was expected that 
teachers and actors at other levels of the educational sys- 
tem would redirect their efforts and find ways to improve 
student achievement. In contrast to the hoped-for idea of 
coherent capacity building envisioned for Goals 2000: 
Educate America Act of 1994, passed earlier that same 
year, ESEA set forth primarily an incentives theory of 
change. It assumed that — with sufficient motivation — 
teachers (and other relevant school personnel) would 
find the means to improve instruction. Unfortunately, 
early implementation research showed that many schools 
lacked an understanding of the changes that were 
needed, and also lacked the capacity to make them hap- 
pen.^ NCEB intensified the commitment to leverage 
change through test-based accountability, and in at least 
two significant ways, “upped the ante”: (1) its focus on 
disaggregating data (to make it possible to track the per- 
formance of various subgroups such as major ra- 
cial/ethnic groups or those with limited English profi- 
ciency) reflected the widespread belief that the achieve- 
ment gap was an unacceptable reality in American edu- 
cation and that hard data demonstrating inequality of 
outcomes would be necessary, if not sufficient, to rem- 
edy the situation; and (2) its urgency — as reflected in the 
requirement that all students reach a certain standard of 
performance by 2014 — spoke to the growing frustration 
across the nation about the slow pace of progress. The 



current policy context, therefore, is best understood as a 
blend of standards rhetoric and test-based accountability 
practices. 

Research findings about the effects of standards and test- 
based accountability have been both promising and dis- 
appointing. Educators have redirected efforts as in- 
tended, adopting curricula aligned with state standards^ 
and dramatically increasing the amount of instructional 
time devoted to reading and mathematics.^ Accountabil- 
ity pressures have resulted in increased use of test data to 
redirect instructional efforts, extensive test preparation 
practices, and increasing use of interim and benchmark 
tests administered periodically to monitor progress to- 
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ward mastery of standards. 

Although it is difficult to tie achievement results to spe- 
cific policies, some researchers have found positive links 
between states with stronger accountability policies and 
relatively more improvement on the National Assess- 
ment of Educational Progress (NAEP). Eor example, 
from 1996 to 2000 one study found that “high- 
accountability states” had relatively greater gains on 
NAEP in mathematics for eighth grade and for African- 
American and Hispanic fourth graders scoring at the Ba- 
sic Eevel.^ Positive findings for test-based accountability 
have been partially confirmed by other researchers.**' Us- 
ing NAEP data from 1992-2002 one study found higher 
achievement associated with accountability but no nar- 
rowing of the Black-White achievement gap. Positive 
effects are also challenged by evidence that some states 
may have excluded large numbers of students from 
NAEP testing or by evidence that gains were weaker or 
nonexistent when student cohorts were tracked longitu- 
dinally.** 

A majority of states have reported gains on state tests 
and a general closing of gaps,*^ but these increases need 
to be viewed cautiously. Eor example, because of the 
pervasive problem of test-score inflation (i.e., score 
gains that overstate the underlying gains in genuine 
learning) that can occur when teachers and students be- 
come increasingly familiar with the content and format 
of state tests, researchers prefer to rely on NAEP as a 
source of more credible information about achievement 
progress. General increases on NAEP and gap closings 
have continued since 2002, especially in mathematics, 
but the improvements are much more modest than on 
state tests, and there is no difference in the rate of gain 
before or after NCEB.*** A fair conclusion from all of 
these studies might be to say that, since 1992, the era of 
test-based accountability has been associated with in- 
creasing student achievement, but improvements have 
not been as clear-cut or dramatic as had been hoped and 
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cannot be attributed solely to accountability policies. Al- 
though the trend continues to be positive, the intensifica- 
tion of pressures since NCLB has not produced com- 
mensurately greater gains. 

Studies showing positive changes in instructional prac- 
tices because of accountability have also documented 
significant negative effects. For example, it is again the 
case that tests have had a stronger impact on teaching 
than standards. Tested subjects receive much more in- 
structional time than non-tested subjects, driving out art, 
music, and physical education, but also reducing time for 
science and social studies, especially for disadvantaged 
and minority students assigned to increased doses of 
reading and mathematics.*^ Citizens and policy makers 
are generally aware of the problem that teachers face 
strong incentives to emphasize content that is tested, 
which in some cases can become so strong that they ac- 
tually “teach the test.” Many educators, parents, and pol- 
icy makers believe it reflects a necessary trade off — 
arguing that reading and math are the most essential 
skills and must be mastered even at the expense of other 
learning. However, research on teaching the test shows 
that pressure to raise test scores changes not only what is 
taught but how it is taught. 

If teaching the test means practicing excessively on 
worksheets that closely resemble the test format, then it 
is possible for test scores to go up without there being a 
real increase in student learning. This problem of test 
score inflation can explain why scores are rising more 
dramatically on high-stakes state tests than on NAEP, al- 
though it is difficult to estimate the exact amount of in- 
flation.*® More significantly for the students themselves, 
the emphasis on rote drill and practice often denies stu- 
dents an opportunity to understand context and purpose 
that would otherwise enhance skill development.*^ It is 
much more interesting to work on writing skills, for ex- 
ample, after reading a book about Martin Luther King 
than it is to practice writing to test prompts. Some teach- 
ers also report focusing their efforts on bubble kids, 
those who are closest to the proficiency cut score, so that 
a small improvement can make a big difference in the 
school’s percent proficient number.*^ These kinds of 
problems — gaming, distortion, and perverse incentives — 
are well known in the economics literature on incen- 
tives*^ and can be expected to occur when performance 
indicators are imperfect measures of desired outcomes. 
How these problems can and should be weighed by edu- 
cators and policy makers is a thorny question for which 
the available findings on unintended consequences of 
test-based accountability provide useful but insufficient 
information. 



According to some policy advocates, standards-based re- 
forms have had limited success because underlying in- 
centive structures have not been well enough understood 
and implemented. For some content experts, especially 
in mathematics and science, the reforms have failed be- 
cause standards and assessments still do not reflect what 
is known from cognitive science about how conceptual 
learning develops in these fields. For other reformers, 
the promise that standards would ensure equity has been 
broken because sanctions have been imposed without 
sufficient support to make it possible for standards to be 
met, especially in the poorest schools. Each of these per- 
spectives has validity and, taken together, they can help 
us understand how it is that standards-based reforms 
have not yet been implemented in a way that adequately 
reflects original intentions. These imperfections and 
costs notwithstanding, policy makers feel an urgent need 
to use standards as a tool for improved education. In- 
sights about how reforms have fallen short can lead to 
improvements in the design and implementation of stan- 
dards and serve to leverage much needed reforms. 

Content Standards 

The Goals 2000: Educate America Act of 1994 defines 
content standards as “broad descriptions of the knowl- 
edge and skills students should acquire in a particular 
subject area.”^* For many states, the content standards 
adopted a decade ago represented that state’s first effort 
at trying to develop some kind of curriculum framework. 
For most, the process was highly political — as well it 
should be in a democratic society. But without previous 
experience and access to coherent curricula representing 
particular curricular perspectives, the political solution of 
adding in everyone’s favorite content area topic created 
overly-full, encyclopedic standards in some states, or 
vague, general statements in others. It should thus be no 
surprise that today’s state content standards vary widely 
in coverage, rigor, specificity, and clarity despite the 
admonition in NCLB that all states should adopt “chal- 
lenging academic content standards” with “coherent and 
rigorous content.”^^ 

There is now considerable political support in the United 
States for new common standards. The National Gover- 
nors Association and the Council of Chief State School 
Officers are leading an effort to create shared high 
school graduation standards and grade -by-grade content 
standards in math and language arts, and — almost imme- 
diately — 46 states signed on to the development effort.^"* 
It is hoped that common standards will be both more rig- 
orous and better focused, but this politically important 
effort will not necessarily produce a better result unless 
past problems are better understood. In a recent work- 
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shop held hy the National Research Council that ad- 
dressed common standards, presenters argued that the 
development process for standards has often left out 
more complex, discipline-hased expertise about how 
knowledge, skills, and conceptual understanding can he 
developed together in a mutually reinforcing way. A re- 
cent study of language arts, science, and mathematics 
content standards in 14 states, for example, found only 
low to moderate alignment between state standards and 
corresponding standards defined by national professional 
organizations (e.g., the National Council of Teachers of 
Mathematics).^^ There is a strong research base docu- 
menting how students develop advanced proficiencies in 
science and mathematics, and correspondingly what 
pedagogical practices tend to support such learning.^® 
But, in the standards negotiation process, these more 
complex understandings are likely to be replaced by in- 
clusive but disorganized lists of topics. Research on edu- 
cation reforms has clearly documented the need for cur- 
ricular coherence to make sure that the pieces of reform 
work together, provide support for teacher learning, and 
convey consistent messages to students. Curricular co- 
herence can refer both to the ways that policy instru- 
ments fit together — standards, assessment, and profes- 
sional development — and to features of the curriculum 
itself. Although most states and districts have attended 
to issues of alignment among standards, assessments, 
and textbooks, these are skeletal match-ups — outlines of 
similar topics — that do not address deeper issues of con- 
ceptual congruence between challenging curricular goals 
and the underlying structure of prerequisite topics and 
skills needed to achieve them. It is important to recog- 
nize that broad content standards, at least as developed 
thus far in the United States, do not have the specificity 
of curricula as typically developed in other countries, 
where there is greater clarity about the depth of coverage 
and the appropriate sequencing of topics. Proponents of 
common standards will need to clarify whether content 
standards will continue to be curriculum frameworks, in- 
tended as rough outlines of what should be taught. Or, 
will they take on the tougher task of specifying common 
grade -by-grade curricular goals? 

Studies of the top-performing countries in the Third In- 
ternational Mathematics and Science Study (TIMSS) 
provide examples of coherent curricula. In contrast to 
U.S. state standards that appear to emphasize “rote 
memorization of particulars,” mathematics and science 
curricula in the Czech Republic, Japan, Korea, and Sin- 
gapore reflect a hierarchical sequencing of topics de- 
signed to move progressively toward more advanced top- 
ics and a deeper understanding of the structure of the 
discipline. Differences among the top-performing 
countries indicate that there is more than one way for 



topics to be organized, but importantly, in each case 
choices have been made so that each country’s curricu- 
lum is coherently organized with fewer topics per grade 
than the overwhelming and repetitive lists typically 
found in the United States. Although NCLB alignment 
requirements were intended to correct the “mile wide 
and inch deep” curriculum problems identified by 
TIMSS researchers, the most recent research indicates 
that these problems are largely unabated and also occur 
with English, language arts, and reading standards. A re- 
cent analysis of content standards from 14 states found 
that they failed to focus on a few big ideas and that they 
were not differentiated by grade so as to indicate how 
topics should build from one grade to the next. Of great- 
est relevance for a common standards effort, when con- 
tent standards were compared between pairs of states, 
there was an average of only 20 percent overlap in the 
topics and level of cognitive complexity intended to be 
taught at each grade. 

Many analysts have pointed to the national control of 
curriculum in top performing countries as a means to en- 
sure curricular coherence and correspondingly higher 
achievement. A finer-grained analysis, however, shows 
that national control is not required for coherence. 
Rather, coherence leads to effective outcomes if it is 
achieved at whatever level of governance has authority 
over policy instruments. In a study of the 37 nations in 
TIMSS, only 19 reported having a single, centralized 
educational system. Unfortunately, neither states nor dis- 
tricts in the United States have a tradition of curriculum 
development and associated teacher professional devel- 
opment like that of many other countries, whether at the 
national or provincial level. 

Performance Standards 

Complaints that standards differ widely from state to 
state often confuse content standards with performance 
standards. Whereas content standards refer to the 
knowledge and skills that students should acquire in a 
particular subject area, performance standards are “con- 
crete examples and explicit definitions of what students 
have to know and be able to do” to demonstrate profi- 
ciency in the skills and knowledge outlined by the con- 
tent standards.^* Performance standards are best repre- 
sented by showing pieces of student work that illustrate 
the quality of an essay or the demonstration of mastery 
that is expected. In practice, however, performance stan- 
dards are often expressed simply as cut scores on a test. 
For example, students might be required to get 85 per- 
cent of the items on a test correct to meet the proficiency 
standard. Unfortunately, the ideal of “concrete examples 
and explicit definitions” has been compromised by the 
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use of cut scores so that the connection between per- 
formance standards and content standards is much less 
obvious and less transparent. 

Typically, cut scores are set by panels of judges, usually 
community leaders as well as educators. Various stan- 
dard setting procedures are used to help panelists make 
their judgments; basically the process involves asking 
each panelist what percentage of items should be an- 
swered correctly to demonstrate proficiency, and com- 
puting the average of these cut scores across panelists. 
The procedures for setting cut scores are not scientific, 
and do not lead to the estimation of some true profi- 
ciency standard. Results can vary dramatically depend- 
ing on whether judges are shown multiple-choice or 
open-ended items and whether they are asked to set 
“world class” or grade-level passing standards. Not 
surprisingly, some states have thus set proficiency stan- 
dards that only 15 percent of their students can pass and 
others have set standards that 90 percent of their students 
can pass. A recent study commissioned by the National 
Center for Education Statistics verified that these differ- 
ences were caused by differences in the stringency of the 
standards, not by real differences in student perform- 

33 

ance. 

Reporting improvements in percent proficient has been 
the standard metric for tracking progress since the be- 
ginning of the standards movement. But, percent profi- 
cient does not tell us much if proficiency is defined so 
differently by states. And, reporting in relation to a pro- 
ficiency cut score has created the problem of focusing on 
bubble kids,^"^ which would not occur if indices were 
used that accounted for the status and growth of every 
student. In addition, statisticians have demonstrated that 
comparing proficiency percentages can create a very 
misleading picture of whether gaps are actually shrinking 
for the majority of students in the reporting groups. For 
example, if cut scores are set either very high or very 
low, the gaps between groups appear to be small; con- 
versely, the gaps between groups appear quite large 
when cut scores are set in the middle of the test score 
range. Because of limitations of the percent proficient 
metric, most studies of achievement trends also use some 
other metric, such as effect sizes, to quantify achieve- 
ment changes over time.^® 

Policy makers are also aware of the problem that tradi- 
tional status measures — which report the current 
achievement level for a given group of students — tend to 
reward schools serving affluent neighborhoods rather 
than creating incentives to ensure that all students re- 
ceive the help they need to make significant progress. 
Progress on status measures is then evaluated using suc- 



cessive cohorts of students, for example by comparing 
this year’s fourth graders to last year’s fourth graders. 
Schools in communities similar to Beverly Hills, Shaker 
Heights, and Scarsdale do relatively well with status 
measures, even if the quality of instruction may only be 
mediocre, because students from advantaged back- 
grounds enter school with higher levels of achievement 
and continue to receive additional resources from outside 
of school. 

Recent interest in growth measures and value-added 
models represent efforts by policy makers and technical 
experts to try to create reporting metrics that are more 
likely to capture the educational contribution of specific 
schools and districts. Growth measures follow the same 
cohort of students across years, and are able to show, for 
example, how much fifth graders have gained compared 
to their performance as fourth graders the previous year. 
The amount of gain can then be evaluated depending 
upon whether students are gaining at a rate that is faster 
or slower than the typical rate. Value-added models are 
complex statistical procedures used in conjunction with 
growth data and are intended to quantify how much each 
teacher or school has contributed to a student’s growth in 
comparison to the average rate of growth. Although there 
are serious questions about whether research would sup- 
port the high-stakes use of value-added models to make 
decisions about individual teachers, it is clear that some 
indicator of student growth would add important infor- 
mation to accountability systems beyond that provided 
by status measures alone. 

Assessments 

Regardless of intentions, each new wave of educational 
reform has had to face the problem that high-stakes tests 
strongly influence what is taught. The authors of A Na- 
tion at Risk lamented the pernicious effect of minimum 
competency testing “as the ‘minimum’ tends to become 
the ‘maximum,’ thus lowering educational standards for 
all.” Yet, reform legislation in nearly every state sub- 
sequently mandated basic skills testing that perpetuated 
the problem of dumbed-down instruction. In the early 
1990s, advocates for standards used terms like authentic, 
direct, and performance-based to argue for fundamen- 
tally different kinds of assessments that would better rep- 
resent ambitious learning goals requiring complex analy- 
sis and demonstration of skills rather than just recall and 
recognition of answers. The idea of alignment between 
assessments and standards was meant to ensure that as- 
sessments would, indeed, measure learning goals repre- 
sented in the content standards. Unfortunately, in prac- 
tice, alignment has been claimed whenever test items fit 
somewhere within the standards framework rather than 



5 



asking the more important question: Do all of the test 
items taken together reflect the full reach of what was in- 
tended hy the content standards? 

In the most comprehensive study completed since 
NCLB, test items were compared to state content stan- 
dards in each of nine states in mathematics and English, 
language arts, and reading, and in seven states for sci- 
ence. On average across states, the content and cogni- 
tive demand in mathematics matched only 30 percent of 
the standards’ expectations at fourth grade and only 26 
percent at eighth grade. The corresponding figures in 
English, language arts, and reading were 19 percent 
(fourth grade) and 18 percent (eighth grade), and in sci- 
ence, 24 percent (fourth grade) and 21 percent (eighth 
grade). Some state assessments agreed with their own 
state content standards as much as 43 percent and others 
as little as 9 percent. Consistent with earlier critiques of 
standardized tests, this mismatch occurred primarily he- 
cause tests tapped lower levels of cognitive demand than 
intended hy the standards. Especially in mathematics, 
three-quarters of tested content was at the procedural or 
recall level of cognitive demand. Thus, standards-hased 
reform rhetoric has not yet produced the envisioned re- 
forms of assessments needed to measure higher order 
thinking abilities — such as data analysis and generaliza- 
tion, reasoning from evidence and being able to identify 
faulty arguments, drawing inferences and making predic- 
tions, or the capacity to synthesize content and ideas 
from several sources. 

Teacher Professional Development 

It was recognized at the beginning of the standards 
movement that teaching much more challenging curric- 
ula to all students was a tall order. It would mean provid- 
ing all students with rich and engaging instructional ac- 
tivities that previously had been offered only to more 
academically advanced students. Because this vision 
would require fundamental changes in instructional prac- 
tices, capacity building and teacher professional devel- 
opment were seen as key ingredients in support of re- 
forms. Unfortunately, these expectations were rarely 
translated into policy. Eew states invested in training to 
help teachers teach rigorous subject matter in engaging 
ways. Even in states like Kentucky, which invested in 
teacher professional development, training was limited."^* 
In most cases, policy makers relied on the state tests to 
convey changes that were needed. Although this was 
sometimes straightforward — for example, adding writing 
tests increased the amount of instructional time devoted 
to writing — accountability tests did not help teachers 
learn how to teach for conceptual understanding. Recent 
surveys still document significant capacity issues at both 



state and district levels."^^ Teachers lack the training to 
interpret data about their students and often do not know 
how to adapt instruction for struggling students They 
also may not themselves know enough about the disci- 
pline they are teaching and about methods for teaching in 
that discipline (especially in the case of mathematics and 
science) to be able to teach in ways that are both engag- 
ing and conceptually deep. 

Solutions to the capacity problem are likely to be costly. 
According to recent summaries of evaluation studies, ef- 
fective professional development programs can be nei- 
ther brief nor superficial. Effective programs — those that 
changed teaching practices and improved student out- 
comes — focused on both content knowledge and particu- 
lar aspects of content mastery related to student learning; 
they were coherently linked to curricular expectations, 
involved the sustained participation of teachers over long 
periods of time, and allowed teachers the opportunity to 
try new methods in the context of their own practice 
The need to ensure that beginning teachers are ade- 
quately prepared to teach challenging curriculum is 
equally great. Studies of initial teacher preparation find, 
for example, that program features such as curriculum 
familiarity and supervised opportunities to gain experi- 
ence with specific classroom practices account for sig- 
nificant differences in the effectiveness of first-year 
teachers. 

School and System Accountability 

Although the vision of standards-hased reform called for 
the redirection of effort at every level of the educational 
system, accountability requirements have been focused 
primarily on the individual schools. The school as the lo- 
cus for improvement has a legitimate basis in research. 
Research on effective schools, for example, documents 
that schools with a sense of common purpose and em- 
phasis on academics can produce student achievement 
well above demographic predictions."^® But, this research 
often relied on case studies of exceptional schools. 

It has become increasingly clear that poor-performing 
schools are not able to address the problems that reflect 
the larger context of which they are a part. Great inequi- 
ties exist among schools in resources, in the needs of the 
students they serve, and in the qualifications of the 
teachers and leaders they are able to attract and retain. 
Recent findings from the Programme for International 
Student Assessment (PISA) indicate that socioeconomic 
background factors have a much bigger impact on stu- 
dent performance in the United States than in most other 
countries."^^ Research evidence is accumulating to sug- 
gest that school inequities in the United States are exag- 



6 



gerating existing socioeconomic differences and may be 
contributing to Black-Wbite and Latino-Wbite gaps in 
test scores. “ Schools are much more unequally funded in 
tbe United States than in bigb-acbieving nations."^® Fur- 
thermore, large numbers of lower socioeconomic status 
and minority children are attending increasingly segre- 
gated schools, and such schools have difficulty recruiting 
and retaining high-quality teachers and suffer other re- 
source limitations as well.^° 

Studies focused specifically on the impacts of account- 
ability have documented that school-based accountability 
mechanisms can indeed be a formula for the rich getting 
richer. Better-situated schools serving higher socioeco- 
nomic neighborhoods with higher quality academic pro- 
gramming are more able to respond coherently to the 
demands of external accountability.^* High-performing 
schools, for example, already have in place the kind of 
instruction that is needed and thus can redirect the effort 
of well-qualified teachers to make sure that all students 
are able to meet the standards. In contrast, schools with a 
high concentration of poor performing students have to 
try to put in place for the first time the kinds of academic 
structures that are needed but frequently those schools 
lack the expertise and resources to do so.^ 

As early as 1999, the National Research Council Com- 
mittee on Title I Testing and Assessment called attention 
to the problem of placing too great a weight on schools 
with limited capacity to respond. As was suggested a 
decade ago, this imbalance needs to be redressed. To be 
sure, teachers and school administrators should be held 
accountable for their part in improving student learning. 
But equally so, “districts and states should be held ac- 
countable for the professional development and support 
they provide teachers and schools to enable students to 
reach high standards.”^^ In addition, states are responsi- 
ble for redressing the large inequities among socioeco- 
nomic groups that exist before students enter school and 
that persist throughout. Researchers have estimated, for 
example, that fully half of the Black-White achievement 
gap that exists at twelfth grade could be erased by elimi- 
nating the differences that exist when children start 
school.^"* 

Recommendations 

Standards-based education is still the core idea guiding 
education policy and education reform. But the forego- 
ing issues need to be addressed if the promises of stan- 
dards-based education are to be kept. As yet, neither 
state content standards nor state tests reflect the ambi- 
tions of standards-based reform rhetoric, and the link be- 
tween high expectations for all students and capacity 



building has been almost forgotten. The intentions of 
standards-based education — to focus greater attention on 
student learning, to ensure the participation and success 
of all students, and to provide guidance for educational 
improvement — are in the best interest of the country. We 
know enough to create a new generation of policies, 
tests, and curricula that will focus greater attention on 
learning and will reduce the amount of effort spent pre- 
paring students for tests that do not adequately reflect the 
conceptual goals of instruction. 



RECOMMENDATION 1: The federal government 
should encourage the redesign and clear connection 
of content and performance standards — and the cur- 
ricula, teacher training, and high-quality assessments 
to go with them — with the goal of developing clearly 
articulated statements of the expected progression of 
learning. Efforts to develop these components may 
involve partnerships among states, universities, 
groups of teachers, scholars, and the private sector. 



In a well-functioning, standards-based system, all of the 
components of effective instruction — teaching, curricu- 
lum, professional development, assessments — are keyed 
to the content standards. The standards are coherent and 
organized around important ideas, skills, and strategies. 

Curricula provide teachers and students with a roadmap 
for how to reach proficiency, acquiring and extending 
knowledge and skills along the way. Professional devel- 
opment is designed to help teachers move students to- 
ward mastery of the standards. Assessments fully repre- 
sent the standards and ask students to complete tasks that 
draw on and measure their knowledge of the content, 
procedural skills, understanding, and ability to apply 
what they know to new situations. In such a system, 
teachers do not have to try to decipher the meaning of 
poorly articulated standards, guess what will be stressed 
on assessments, or have their students practice narrow, 
test-like items. 

Some states create curriculum frameworks to help teach- 
ers in planning units of instruction. To be most useful, 
these frameworks should be built around established 
progressions for how students grow in comprehension 
and skill. Effective learning progressions reflect an un- 
derstanding of how children learn and what students al- 
ready know.^^ Although state standards have been in use 
since the late 1980s, and scholarly work on progressions 
has made significant strides in recent years, there has 
been little attention in the United States to incorporating 
the most up-to-date thinking about cognition and learn- 



ing progressions into curriculum materials and assess- 
ments. 

It is possible for learning to progress in a number of dif- 
ferent ways. In this country, for example, fractions are 
taugbt before decimals. But in many other countries, 
decimals are taugbt before fractions or at tbe same time. 
It is possible that either approach may work, but which- 
ever it is, the progression and its underlying rationale 
and strategy must be carefully articulated and then sup- 
ported with instructional guidance and appropriate as- 
sessments.^® 

In most current practice, content standards are developed 
in each state and then turned over to test developers to 
construct state tests that are more or less consistent with 
the state standards (often less, as noted above). In con- 
trast, the National Research Council report on state sci- 
ence assessments proposed a fundamentally different ap- 
proach focused on coherence. A successful standards- 
based science assessment system would be horizontally 
coherent by having curriculum, instruction, and assess- 
ment all keyed to the same learning goals; vertically co- 
herent by having the classroom, school, school district, 
and state all working from the same shared vision for 
science education; and developmentally coherent by tak- 
ing account of how students’ understanding of science 
develops over time.®^ To achieve this kind of coherence, 
states may need to consider more integrated, concurrent 
ways of developing standards and assessments. For ex- 
ample, the intention of standards might be conveyed bet- 
ter if they were accompanied from the beginning by pro- 
totypes of assessment tasks. More importantly, the de- 
velopment of learning progressions across grades re- 
quires empirical testing. Learning progressions as they 
have been developed in other countries are based on re- 
search and professional judgment about the logical se- 
quence of skills and topics, followed by empirical verifi- 
cation as curricular materials are developed, tested, and 
revised.®* 

President Obama said in March 2009 that he wanted 
states to adopt “tougher, clearer” standards that rival 
those in countries where students out-perform their U.S. 
counterparts. He called on states to join consortia to “de- 
velop standards and assessments that don’t simply meas- 
ure whether students can fill in a bubble on a test, but 
whether they possess 2L’‘-century skills like problem 
solving and critical thinking and entrepreneurship and 
creativity. 

With the governors and chief state school officers em- 
barked on a joint effort to develop common standards, 
consortia of states may still be needed to take the next 



step of developing deeper grade-level curricula, models 
of teacher professional development, and complementary 
instructional resources — including illustrations of the 
kind of tasks that will demonstrate mastery of the stan- 
dards and providing detailed guidance on assessments. 
These consortia might involve universities, professional 
associations, subject-matter experts, think tanks, or other 
entities, but an important distinction should be drawn be- 
tween the political process needed to achieve consensus 
and guide policy decisions versus the scientific expertise 
needed to develop and rigorously evaluate curricular ma- 
terials, instructional strategies, and assessments. We rec- 
ommend that the federal government help a number of 
these development projects get off the ground, because 
states do not have the resources or, in many cases, the 
expertise, to do this on their own. 

The question of standards and accountability in high 
schools is more complex than it is in elementary and 
middle schools. Students follow different academic paths 
through high school, with different ends in mind. In most 
other countries, high school examinations measure stu- 
dents’ knowledge of subject matter or their readiness for 
vocations. In the United States, by contrast, most states 
require students to pass general skill proficiency exams 
to graduate from high school, and also use the exam re- 
sults in complying with NCLB accountability provisions. 
High school exit exams are often unrelated to courses or 
curriculum and vary across states in the levels of skills 
measured from eighth to eleventh grade. 

In moving away from generic, basic skills tests at the 
high school level, policy makers are immediately con- 
fronted with the issue of what the curriculum should be, 
because higher-level critical-thinking skills are not con- 
tent free. Analytical reasoning and problem solving can- 
not be learned or assessed in the absence of challenging 
content. So, going deeper requires making choices about 
which content to cover and what specific content can be 
fairly assessed. Specifically, then, states or consortia of 
states will face the question of whether there should be 
one curriculum or multiple curricula to prepare students 
for college and the workplace. Common standards do not 
resolve the question as to whether there should be multi- 
ple ways of getting there. 

As states seek ways to develop more challenging curric- 
ula to engage students who have been ill-served by tradi- 
tional college preparatory courses, they may wish to con- 
sider and carefully evaluate career-related courses or cer- 
tification programs. Recent studies of some career sys- 
tems in Europe suggest that explicit career preparation 
programs can have important benefits, especially those 
that involve youth apprenticeships or a combination of 
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part-time work and formal schooling leading to an occu- 
pational certificate. In some countries, these programs 
have high completion rates and support more rapid tran- 
sitions to employment.^* Importantly, they could also he 
a means for providing more authentic learning contexts, 
which — according to cognitive science research stud- 
ies — increase learning hy helping students draw connec- 
tions and see why things work the way they do. Accord- 
ing to the National Research Council Committee on In- 
creasing High School Students’ Engagement and Moti- 
vation to Learn, career academies and other occupa- 
tional-themed programs can improve student motivation 
and engagement, hut only if academic courses are well 
structured to ensure a wide range of competencies and 
are integrated well with meaningful work placements.®^ 
Great care must he taken, however, to avoid old notions 
of vocational education or dead-end low-ahility tracks. 
In a comparative study, for example, dual system curric- 
ula in Austria, Germany, and Switzerland were found to 
have much greater depth of content in mathematics and 
science, greater integration of academic and applied con- 
tent, and higher demand for cross-disciplinary higher- 
order skills than typical high school curricula in the 
United States.®^ Furthermore, these features are associ- 
ated with academic performance in the European coun- 
tries that matched or exceeded U.S. performance. 

Ten states currently use end-of-course exams for ac- 
countability pumoses, and other states have plans to im- 
plement them.®'’ These exams differ significantly, how- 
ever, from the course assessment systems in most high- 
achieving nations. In contrast to most end-of-course tests 
in the United States, high school assessments in Austra- 
lia, Finland, Hong Kong, the Netherlands, Singapore, 
Sweden, and the United Kingdom — among others — are 
generally developed hy high school and college faculty 
and comprise largely open-ended questions and prompts 
that require considerable writing, analysis, and demon- 
stration of reasoning. Most also include intellectually 
ambitious tasks that students complete during the course, 
such as science investigations, research papers, and other 
projects that require planning and management as well as 
the creation of more extensive products. These tasks are 
incorporated into the examination score. Finally, the ex- 
aminations are used to inform course grades and college 
admissions, rather than to serve as exit exams from high 
school, which allow them to reflect more ambitious stan- 
dards.®® 

Advances in assessment may be undertaken by consortia 
of states — as suggested above — that could work on high 
school standards, curricula, and related exams. Although 
we are not yet able to evaluate the quality and rigor of its 
products, efforts by Achieve to create a common end-of- 



course exam for algebra for participating states are one 
example of this kind of approach.®® A similar effort, per- 
haps involving employers, could be undertaken to de- 
velop certification standards and exams for different ca- 
reers. Indeed, such certification may provide an incentive 
for students to stay in school. Because a significant pol- 
icy question to be addressed is whether students benefit 
most from a single college-preparatory curriculum or a 
combination of college preparatory and career prepara- 
tion options, federal investments should be made in both 
types of curricular models to allow for rigorous, com- 
parative evaluations of the two systems. 



RECOMMENDATION 2: The federal government 
should support research on accountahility system in- 
dicators to reflect both the status and growth of stu- 
dents. Performance standards should set amhitious 
hut realistic targets for teaching and learning, and 
they should communicate to the public, parents, edu- 
cators, and students themselves what is to be learned. 
Assessment results should be reported in ways that 
recognize progress all along the achievement contin- 
uum. 



In order to have a constructive influence on the behavior 
of students, educators, and schools, accountability indi- 
cators must clearly communicate both what is to be 
learned and how students are progressing in their learn- 
ing, as well as illustrating what it means to be proficient. 
Accountability reporting systems also should indicate 
how much students are improving on a range of indica- 
tors of learning, how they are progressing through school 
to graduation, and what kinds of resources are available 
to them. When targets are set to help in evaluating the 
rate of growth, they should be ambitious but reachable. 
As basic as these criteria might seem, many state ac- 
countability systems do not meet them at present. 

The problems with judging schools or teachers using 
only a percent proficient criterion are now much better 
understood. Prior to standards-based reform, schools 
were judged by whether their test scores were above or 
below average. Comparisons based on averages were 
unpopular with policy makers because they implied 
complacency with being mediocre and did not provide 
any substantive insight into desired levels of perform- 
ance. Now that the weaknesses of proficiency scores are 
also understood, alternative metrics should be consid- 
ered. For example, reporting gains by comparing means 
is the reporting unit preferred by statisticians because 
means take account of all of the student scores. And, for 
those who want additional comparative standards in or- 



der to decide whether mean scores are “good enough,” 
the meaning of averages can he augmented substantively 
hy benchmarking to international comparisons along 
with sample tasks that illustrate performance capabilities 
at different levels. 

The requirement in NCLB that states define Adequate 
Yearly Progress (AYP) and use it as the basis for holding 
schools and districts accountable sounds eminently rea- 
sonable. Yet, AYP changed the meaning of “adequate” 
in a misleading way that threatens to undermine the 
credibility of the accountability system. In NCLB, the 
term adequate was not defined as normal or even exem- 
plary progress; rather, it was based on a calculation of 
the rate of progress needed to get to 100 percent profi- 
ciency by the year 2014. Even if the deadline was a long 
way off, 100 percent proficiency was not a reasonable 
goal. The improvement curve was very steep, especially 
for second-language learners and special education sub- 
groups who, by definition, need special help to partici- 
pate fully in regular instruction, but still were required to 
reach the same target. Critics of this aspect of the law 
have argued that standards-based reforms could establish 
much more ambitious goals than have previously been 
achieved, especially for low-performing and at-risk 
groups, but should nonetheless set targets that are realis- 
tic. The idea of an existence proof criterion^^ means that 
there should be at least one example of a school or a dis- 
trict that achieved an aspirational goal before it can be 
mandated for everyone. For example, states might set 
test score targets at the 75th percentile or even the 90th 
percentile of what similar schools had achieved. This 
idea uses norms to help decide what is reasonable, but 
substantially ratchets up expectations rather than assum- 
ing that the 50th percentile remains a satisfactory goal. 

Increasingly, states are aware that status measures of 
student performance reward schools that serve the most 
able students without necessarily reflecting the quality of 
education in those schools. At the same time, comparing 
schools based on similar demographics or growth ap- 
pears to set lower expectations for schools serving poor 
and minority communities. By examining both indica- 
tors, accountability systems can give credit for signifi- 
cant growth, and at the same time attend to the fact that 
desired performance goals still have not been met. 
Schools low on both status and growth measures should 
then receive the greatest scrutiny and assistance. 

Because tests and accountability indices based on tests 
are fallible, accountability reporting systems are less 
likely to lead to distortions and perverse incentives that 
focus effort on the wrong outcomes if they attend to mul- 
tiple sources of evidence. Reporting both growth and 



status is one example of how the reporting system can be 
improved to support more valid judgments about the 
quality of schooling. Other ways to improve an indicator 
system include using multiple measures of reading and 
math — such as portfolios along with standardized tests, 
measures of academic goals beyond reading and math, 
indicators of school climate and non-academic goals, and 
tracking of progress for significant subgroups. For ex- 
ample, one of the most successful aspects of NCFB re- 
porting has been the disaggregation of test score results 
and reporting of progress for each subgroup (e.g., major 
racial/ethnic groups, economically disadvantaged stu- 
dents, students with disabilities, English-language learn- 
ers, etc.).®^ Although there have been flaws in the specif- 
ics — especially the problems of small sample sizes and 
misleading impacts when the same at-risk students are 
counted multiple times in overlapping subgroups — the 
effort to focus attention on historically low-performing 
and neglected subgroups is generally regarded as one of 
NCFB’s greatest successes. 

The federal government should support the development 
of several different kinds of reporting systems, and once 
in use they should be studied to see which approaches 
work the best for different purposes. For example, when 
Florida wanted to increase attention to learning gains for 
students scoring in the lowest 25 percent of students, 
they added this component to their formula for determin- 
ing school grades, and in 2010 they will add a compo- 
nent for high school students’ participation in accelerated 
coursework. Accountability reporting systems should be 
evaluated both in terms of the incentives they foster and 
the information they provide for improving instruction. 
States or consortia of states should recognize that this is 
a complex issue requiring careful analysis. Although the 
basic ingredients can be decided politically — whether 
graduation rates should be included, for example — the 
mechanics of how they are reported, whether they are 
compared to target or comparative criteria and whether 
they are combined into a single index should be worked 
out carefully and tested in trial sites. States would be 
well-advised to assemble teams of individuals with both 
content and technical expertise to assist them in this 
process. In general, we recommend that compensatory 
models be used rather than the current conjunctive model 
used in NCFB. Compensatory models allow for 
strengths in one area to offset weaknesses in another 
area, at least to a certain degree, whereas with conjunc- 
tive policies, failing on any one dimension means failing 
on all of them.® Because of the concern that compensa- 
tory, composite indices could once again make it possi- 
ble for schools to use the performance of high-achieving 
students to hide the failure of low-performing students, 
reporting systems should require separate reporting by 
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subgroups or build in specific checks to guard against 
this type of abuse. 



RECOMMENDATION 3: The federal government 
should support the redesign and ongoing evaluation 
of accountahility systems to ensure that they contrih- 
ute to school improvement. Less than satisfactory 
school performance should trigger closer investiga- 
tion of school operations before remedies or sanctions 
are applied, and stellar performances should also he 
verified. Different investigative approaches, including 
audit assessments, data-driven analyses, or expert 
constituted inspectorates, should he considered. 



ing assessment administered by tbe district? Wbat if 
nearly all of tbe schools in a particular district show 
growth rates for English-language learners below the 
growth rate for similar students elsewhere in the state? 
How should the locus of the problem be identified and 
what should the mechanism be for marshalling the 
needed resources if it is determined that neither schools 
nor the district have the know-how to provide an ade- 
quate remedy? It would be far too costly to try to collect 
multiple measures of achievement and gather meaningful 
data on curriculum materials, teacher qualifications, and 
school climate and safety for every school every year. 
But, more complete evidence could be collected to verify 
successes and failures and their likely causes if in-depth 
investigations were limited to a small sample of schools. 



It is now a familiar story in nearly every state to read 
about schools that are “excellent” according to the state 
accountability system but “in need of improvement” un- 
der NCLB. Current NCLB policies identify so many 
schools in the United States as failing that the number is 
virtually meaningless. That number will continue to 
grow as schools and districts fall short of the greater and 
greater leaps that will be required to reach 100 percent 
proficiency by 2014. Already, more than a third of U.S. 
schools are not meeting their NCLB targets. Researchers 
in California report that all of the state’s elementary 
schools will eventually be considering low achievers.™ 
Such schools could be required, under current law, to of- 
fer tutoring, allow students to transfer, have their faculty 
and principal replaced, be turned into a charter school, or 
be subject to other interventions. 

Creating more appropriate accountability indices and us- 
ing more scientific means to establish defensible targets, 
as suggested above, will help to address the greatest 
threats to validity and credibility of accountability sys- 
tems. In addition, accountability systems should be de- 
signed with built-in, self-checking mechanisms and 
should be evaluated to determine whether the informa- 
tion they provide and subsequent actions are, indeed, 
improving the education system. Any measure of a 
school’s performance only hints at what is going on in- 
side its classrooms. Test scores alone cannot constitute 
definitive evidence as to the extent of a school’s success 
or failure. Suppose that test scores in mathematics for 
students in one elementary school have risen dramati- 
cally. How can we know if this is a result of test score 
inflation or exemplary teaching practices? Should teach- 
ers in that school receive bonuses and be visited by other 
teachers who want to learn about their teaching strate- 
gies? Or would it be better first to verify that students 
from that school are doing well in middle school mathe- 
matics and also succeed on an open-ended problem solv- 



The federal government should support states that want 
to experiment with the development of two-stage ac- 
countability systems, whereby initial test-score results 
would serve as a trigger, prompting closer examination 
of performance and educational quality in targeted 
schools. Persistently low scores could prompt an evalua- 
tion designed to identify both educational and external 
factors that are influencing scores and would help to 
clarify whether the trends in scores really should be 
taken as a sign of program effectiveness.^* The kinds of 
evidence collected in the second stage could be some 
form of audit assessment, data-driven analyses of exist- 
ing data, or visits by expert inspectorates. We recom- 
mend, in particular, that the federal government encour- 
age the states to experiment with ways of introducing an 
element of human judgment into making decisions about 
which schools merit an aggressive turnaround effort as 
well as the substantive focus of such efforts. 

An important goal of a two-stage approach should be to 
ensure that attributions of exceptional or poor educa- 
tional quality are not made from test score data alone. In 
addition, gathering greater depth of information on a 
sample of schools will make it possible to evaluate the 
accountability system and continuously improve its 
measures and incentive structures. 



RECOMMENDATION 4: The federal government 
should support an intensive program of research and 
development to create the next generation of per- 
formance assessments explicitly linked to well- 
designed content standards and curricula. 



The field of educational measurement, closely tied to re- 
search in cognitive science, has already begun to develop 
major new solutions that will address several of the sub- 



stantive issues raised in this paper. With a significant, 
targeted investment in developing new assessment tools, 
psychometric and content experts could — within a 5-10- 
year period — ^provide a set of cognitive-science based 
tools to guide students and teachers in the course of 
learning. Development efforts should include systematic 
examination of assessments and curricula from high- 
achieving countries with special attention to assessment 
tasks that reflect higher-order thinking and performance 
skills. As new measures are developed, they could then, 
in turn, he tested for use in accountability systems with 
particular attention to their ability to withstand distortion 
in high-stakes settings. 

The program of development and research we recom- 
mend here would create greater conceptual coherence 
between what is assessed externally for accountability 
purposes (e.g., at the state level) and the day-to-day as- 
sessments used in classrooms to move learning forward. 
Enough is known about potential new forms of assess- 
ment that an intensive engineering research program, 
with short cycles of development, field testing, and revi- 
sion could lead to dramatic improvements within a dec- 
ade. Such efforts would alter the nature of assessment in- 
formation, but more importantly would reshape for the 
better the ways that the character of assessments influ- 
ences the character of education. 

A national program of evidence-based assessment devel- 
opment should be launched as quickly as possible. The 
needed development and evaluative research should be 
carried out in multiple laboratories and field tryout sites. 
However, the program as a whole should be overseen by 
a single agency so as to promote maximum sharing of 
ideas, potential solutions, and interim results that might 
be tried out in operational assessments. A number of 
promising assessment tools already exist in prototype or 
experimental form. With appropriate federal government 
support, these tools might be brought together with cur- 
riculum development aimed at higher-level cognitive 
skills and then be moved into wider-scale tryouts, fol- 
lowed by refinements both to raise reliability and to scale 
back costs. The goals for such an effort should include 
the following: 

• Produce learning progressions and assessments 
that measure both content knowledge and higher- 
order, problem solving skills. It is relatively easy 
to measure content knowledge. But, skills such as 
adapting one’s knowledge to answer new and unfa- 
miliar questions, are difficult to measure easily and 
reliably. Moreover, for such assessments to be fair 
and useful, they must be tied to reasonably well- 
documented learning progressions that demonstrate 



how students’ increasing competence can be sup- 
ported and advanced. Learning progressions under- 
lying performance-based testing are deeply substan- 
tive and go well beyond vertical alignment of tradi- 
tional, multiple-choice tests. In just the past decade, 
significant progress has been made in the develop- 
ment of prototype learning progressions in the 
mathematics and science domains, but because 
each major concept, inquiry skill, and problem- 
solving strategy requires analysis and testing to en- 
sure that it can be used effectively in the classroom, 
a great deal of work remains to be done. Such a pro- 
gram of research cannot be focused on the design of 
new assessments in isolation; rather, it requires con- 
comitant investments in methodological, cognitive, 
and subject matter research as well. 

• Accurately and fairly assess English-language 
learners and students with special needs. In re- 
sponse to NCLB, the U.S. Department of Education 
funded state consortia to develop new measures of 
English language proficiency, which significantly 
advanced the field. States have adopted accommoda- 
tion policies to allow less impaired special education 
students to participate in regular state assessments, 
and have developed alternative assessments to assess 
more severely affected students. Basic psychometric 
issues of reliability have been addressed and pro- 
gress has been made in recognizing some of the 
most critical validity issues. Eor example, there is 
now much greater understanding of the importance 
of assessing academic language to enable academic 
success for second-language learners. Much re- 
mains to be done, however, to fairly assess language 
development and to link academic language de- 
mands with progress in mastering specific curricu- 
lum content. Second-language learners are vastly 
different from one another in terms of the first lan- 
guages they speak and the amount of formal instruc- 
tion they have received in each of their languages. 
Each of these differences would require specially 
tailored assessments to ensure validity. Although it 
is true, for example, that any student is disadvan- 
taged if the vocabulary demands of a mathematics 
test do not align well with their textbook, the effects 
on validity and fairness for second-language learners 
are much greater. Therefore, more focused research 
and development is needed to examine the potential 
validity of both curriculum-linked assessments 
(which would allow second-language learners to 
take the same tests as their classmates) and of spe- 
cially tailored, individualized, classroom-based as- 
sessments that could be validly aggregated for large- 
scale accountability purposes. 
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Similarly, newly developed alternative assessment 
measures for students with disabilities have met their 
primary goal of including hitherto excluded students 
in the accountability system. But additional work is 
needed to make sure that accommodations do not ar- 
tificially inflate test performance and to verify the 
learning progressions underlying score scales on al- 
ternative assessments. Do gains on alternative as- 
sessments represent meaningful increments in 
knowledge and skills for these populations of stu- 
dents that can be used to set instructional goals? 

• Identify, test, and expand the availability of tech- 
nology to captnre important learning goals, en- 
hance validity, and rednce the cost of assess- 
ments. Significant advances have been made in the 
use of technology to support traditional large-scale 
assessment programs, and important research and 
development programs are already under way. For 
example, the NAEP and the PISA have already done 
pilot studies of online assessments in mathematics 
and writing and in science, respectively. Computer 
scoring of essays is now as reliable as that by expert 
graders — at least for basic compositions — and is 
much less costly. Computerized adaptive testing is 
already used for high-stakes individual assessments 
on tests such as the Graduate Record Examination, 
and has great potential to support both large-scale 
and classroom-level applications of learning pro- 
gressions because it allows for each student to be 
tested in greater depth at his or her own particular 
level of mastery. Of special interest are break- 
through, next-generation assessments where tech- 
nology is helping to tap complex and dynamic as- 
pects of cognition and performance that previously 
could not be assessed directly. These cutting-edge 
developments show how technology can be used to 
engage students in developing models of scientific 
phenomena, analyzing data, and reasoning from evi- 
dence. This type of work is relatively new and has 
not been tested in wide-scale applications, but with 
further development and evaluation, these technol- 
ogy-based assessments could ultimately be used to 
allow large-scale testing of important analytical and 
problem-solving skills. 

• Develop valid and useful measures of elassroom 
teaching and learning practices. Accountability 
systems drive what both students and teachers do. 
But changes in teaching practices are presently 
documented only by test scores, and research on 
teacher learning warns us that it is much easier to 
improve scores on traditional tests by teaching the 
test rather than by making the fundamental changes 



needed to improve students’ long-term learning. 
There is at present no direct way to measure changes 
in instruction that would withstand the requirements 
of high-stakes use. Although research on classroom 
observational instruments is limited, a great deal is 
known from cognitive science research about what 
to look for in classrooms to see if students are being 
supported to reach more ambitious learning goals, 
and measures of these features could be reliable 
enough to be useful for formative program im- 
provement and research purposes. Eor example, ob- 
servational ratings could document whether students 
are aware of learning goals, are actively engaged and 
take responsibility for their own learning, and 
whether it is part of classroom norms for students to 
be regularly called on to explain their thinking. 
Private foundations (e.g.. Bill & Melinda Gates 
Eoundation, William and Elora Hewlett Eoundation, 
and William T. Grant Eoundation) are now investing 
in the development of measures of teaching quality, 
and with federal assistance, much more substantial 
progress could be made toward solving these meas- 
urement questions. 

• Build greater conceptual coherence between as- 
sessments of student performance used for ac- 
countability purposes and classroom assessments 
designed to provide better instructional guidance 
to teachers. To be as accurate and useful as possi- 
ble, accountability tests should ask students to do 
tasks similar to those they are asked to do in their 
regular classrooms. But, it would be a mistake to at- 
tempt to achieve coherence between large-scale and 
classroom assessments by locking in testing formats 
or dominant instructional patterns from the past cen- 
tury. We know from video studies of international 
and U.S. mathematics classes, for example, that in- 
struction in this country is dominated by practicing 
procedures and reviewing, in contrast to curriculum 
and instructional practices in other countries that are 
focused much more on depth of understanding, rea- 
soning, and the generalization of knowledge.^^ Thus, 
improving assessments in the United States neces- 
sarily requires corresponding improvements in cur- 
ricula and teaching. An intensive research program 
aimed at developing next-generation assessment and 
accountability systems must also undertake, in at 
least some research sites, reform of curricula, reform 
of instructional tasks, corresponding changes in 
large-scale assessments, and significant changes in 
both teacher preparation and professional develop- 
ment to help teachers teach in profoundly different 
ways. A next-generation system cannot be built all at 
once, but with federal research support, one consor- 
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tium of states might pursue the design of curricula, 
instructionally-linked learning progressions, and per- 
formance -based tasks in middle school science and 
try them out with accompanying teacher training. 
Another state or consortium might he funded to un- 
dertake similar development in writing and language 
arts, and so on. 

This ambitious research effort should be overseen by 
teams of the leading cognitive scientists, subject matter 
experts, and measurement professionals. The recommen- 
dation here is not merely to address the shortcomings of 
the current systems of standards, accountability, and as- 
sessment. Rather, the recommendation is to build on de- 
velopments in technology, assessment, and cognition to 
create a truly reformed accountability and measurement 
infrastructure to support the teaching and learning to 
which we aspire for all of our children. 
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